Introduction to Open Data Science - Course Project

Chapter 1: Introduction

About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.

Hi I am Garima. In the previous period, I was a student of Quantitative skills and that’s where I heard about the course This course, according to my understanding will help me explore data more freely and will enhance my knowledge on wrangling with data in a more professional and precise manner. As this is an advanced level course, I truly believe it will further help me with my masters thesis too especially while working out my methodology

My favorite chapter from the RHDS book are chapter 3, 4 and 5 as they are once that truly helped me with my basics. I truly enjoy using the mutate function and also visualizing different plots

Git Repository

My Git Repository link : https://github.com/garimahelsinki/IODS-project.git

# This is a so-called "R chunk" where you can write R code.

date()
## [1] "Fri Nov 25 15:28:41 2022"

The text continues here.


Chapter 2: Multiple regression Analysis

Describe the work you have done this week and summarize your learning.

date()
## [1] "Fri Nov 25 15:28:41 2022"

Here we go again…

setwd("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data")

library(readr)
Data2 <- read_csv("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data/learning2014.csv",show_col_types = FALSE)
Data2
## # A tibble: 166 × 7
##    gender   age attitude  deep  stra  surf points
##    <chr>  <dbl>    <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1 F         53      3.7  3.58  3.38  2.58     25
##  2 M         55      3.1  2.92  2.75  3.17     12
##  3 F         49      2.5  3.5   3.62  2.25     24
##  4 M         53      3.5  3.5   3.12  2.25     10
##  5 M         49      3.7  3.67  3.62  2.83     22
##  6 F         38      3.8  4.75  3.62  2.42     21
##  7 M         50      3.5  3.83  2.25  1.92     21
##  8 F         37      2.9  3.25  4     2.83     31
##  9 M         37      3.8  4.33  4.25  2.17     24
## 10 F         42      2.1  4     3.5   3        26
## # … with 156 more rows

Observation The data contains 166 rows and 7 columns representing various variables like gender, age, attitude, deep, stra, surf, points, etc

Visualizations with ggplot2

# Access the gglot2 library
library(ggplot2)

# initialize plot with data and aesthetic mapping
p1 <- ggplot(Data2, aes(x = attitude, y = points, col = gender))


# define the visualization type (points)
p2 <- p1 + geom_point()

# draw the plot
p2

# add a regression line
p3 <- p2 + geom_smooth(method = "lm")

p3
## `geom_smooth()` using formula 'y ~ x'

# add a main title
p4 <- p3 + ggtitle("Student's attitude versus exam points")

# draw the plot!

p4
## `geom_smooth()` using formula 'y ~ x'

Exploring data frame

pairs(Data2[-1])

# access the GGally and ggplot2 libraries
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(ggplot2)

# create a more advanced plot matrix with ggpairs()

p <- ggpairs(Data2, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

#draw the plot

p

Interpretation of the Correlation Plot above Here, in the above plot, the variable names are displayed on the outer edges of the matrix. The boxes along the diagonals display the density plot for each variable whereas the boxes in the lower-left corner display the scatterplot between each -pair of variables. The boxes in the upper right corner display the Pearson correlation coefficient between each variable.

The Pearson correlation gives us the measure of the linear relationship between two variables. It has a value between -1 to 1, where a value of -1 signifies a total negative linear correlation, 0 signifies no correlation, and + 1 signifies a total positive correlation.

We can see That there is no correlation between deep with points and age. Besides surf doesn’t have a linear relation with any of the other variables as but the relation is although not a total negative yet linearity is missing between them

Simple regression

# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = Data2) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'

# fit a linear model
my_model <- lm(points ~ attitude, data = Data2)

# print out a summary of the model
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Interpretations The p-value suggests that there there is a significant relation between points and attitude

Multiple regression

library(GGally)
library(ggplot2)
# create an plot matrix with ggpairs()
ggpairs(Data2, lower = list(combo = wrap("facethist", bins = 20)))

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = Data2)

# print out a summary of the model
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude + stra, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09
#adding third variable as surf

my_model3 <- lm(points ~ attitude + stra + surf, data = Data2)

#printing out a summary of the model
summary(my_model3)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Interpretation of multiple regression There is a significant increase in points with 1 unit increase in attitude. With 1 unit increase in attitude there is not a significant increase in stra and surf

##Graphical model validation

# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = Data2)

# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5

plot(my_model2, c(1,2,5))

Interpretation of residual models

1.Residuals vs Fitted Here we see that linearity seems to hold reasonably well, as the red line is close to the dashed line. We can also note the heteroskedasticity: as we move to the right on the x-axis, the spread of the residuals seems to be increasing. Finally, points 145, 56, and 35 may be outliers, with large residual values. Let’s look at another data set

2. Normal QQ-plot In the diagram below, the quantile values of the standard normal distribution are plotted on the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset are plotted on the y-axis. You can see that the points fall close to the 45-degree reference line. In the above plot we can see that although most of the observations are equally distributed and falls on the reference line but in the beginning and in the end of the plot, there are some discrepancies, especially with points 145, 35 and 56 in the beginning.

3. Residuals vs Leverage plot *We’re looking at how the spread of standardized residuals changes as the leverage, or sensitivity of the fitted _i to a change in y_i, increases. Second, points with high leverage may be influential:For this we can look at Cook’s distance, which measures the effect of deleting a point on the combined parameter vector. Cook’s distance is the dotted red line here, and points outside the dotted line have high influence. In this case there are no such influential points*

Making predictions

# Create model object m
m <- lm(points ~ attitude, data = Data2)

# print out a summary of the model
summary(m)
## 
## Call:
## lm(formula = points ~ attitude, data = Data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
# New observations
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)

# Print out the new data
summary(new_data)
##     attitude    
##  Min.   :2.200  
##  1st Qu.:2.725  
##  Median :3.350  
##  Mean   :3.325  
##  3rd Qu.:3.950  
##  Max.   :4.400
# Predict the new students exam points based on attitude
predict(m, newdata = new_data) 
##      Mia     Mike   Riikka    Pekka 
## 25.03390 27.14918 19.39317 21.86099

Interpreation *The predict functions tells that for the new data and new students as well there is a substantial increase in one unit of attitude with increase in one unit of points.

End of chapter


Chapter 3: Logistic regression Analysis

date()
## [1] "Fri Nov 25 15:29:01 2022"

About the Data set

The data was collected for the following paper: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.

The secondary student achievement of two Portuguese schools is the focus of these data. The data variables were gathered utilizing school reports and surveys and include student grades, demographic, social, and school-related features. Regarding the performance in two different courses, Mathematics (mat) and Portuguese language(por), two datasets are supplied . The two datasets were modeled in [Cortez and Silva, 2008] under binary/five-level classification and regression tasks. Please take note that the target attribute G3 strongly correlates with the traits G2 and G1. This is due to the fact that G3 is the final year grade (given at the third period), but G1 and G2 are the grades for the first and second periods, respectively.

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira )

2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)

3 age - student’s age (numeric: from 15 to 22)

4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)

5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)

6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)

7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)

9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)

11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)

12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no)

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)

19 activities - extra-curricular activities (binary: yes or no)

20 nursery - attended nursery school (binary: yes or no)

21 higher - wants to take higher education (binary: yes or no)

22 internet - Internet access at home (binary: yes or no)

23 romantic - with a romantic relationship (binary: yes or no)

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)

29 health - current health status (numeric: from 1 - very bad to 5 - very good)

30 absences - number of school absences (numeric: from 0 to 93)

These grades are related with the course subject, Math or Portuguese:

31 G1 - first period grade (numeric: from 0 to 20)

31 G2 - second period grade (numeric: from 0 to 20)

32 G3 - final grade (numeric: from 0 to 20, output target)

Reading data

setwd("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data")

library(readr)
alc <- read_csv("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data/alc.csv",show_col_types = FALSE)
alc
## # A tibble: 370 × 41
##    school sex     age address famsize Pstatus  Medu  Fedu Mjob     Fjob   reason
##    <chr>  <chr> <dbl> <chr>   <chr>   <chr>   <dbl> <dbl> <chr>    <chr>  <chr> 
##  1 GP     F        18 U       GT3     A           4     4 at_home  teach… course
##  2 GP     F        17 U       GT3     T           1     1 at_home  other  course
##  3 GP     F        15 U       LE3     T           1     1 at_home  other  other 
##  4 GP     F        15 U       GT3     T           4     2 health   servi… home  
##  5 GP     F        16 U       GT3     T           3     3 other    other  home  
##  6 GP     M        16 U       LE3     T           4     3 services other  reput…
##  7 GP     M        16 U       LE3     T           2     2 other    other  home  
##  8 GP     F        17 U       GT3     A           4     4 other    teach… home  
##  9 GP     M        15 U       LE3     A           3     2 services other  home  
## 10 GP     M        15 U       GT3     T           3     4 other    other  home  
## # … with 360 more rows, and 30 more variables: guardian <chr>,
## #   traveltime <dbl>, studytime <dbl>, schoolsup <chr>, famsup <chr>,
## #   activities <chr>, nursery <chr>, higher <chr>, internet <chr>,
## #   romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>,
## #   Walc <dbl>, health <dbl>, failures <dbl>, paid <chr>, absences <dbl>,
## #   G1 <dbl>, G2 <dbl>, G3 <dbl>, alc_use <dbl>, high_use <lgl>,
## #   probability <dbl>, prediction <lgl>, prob2 <dbl>, prob3 <lgl>, …

Observation The data set, named here as “alc”, has 370 observations of 35 variables

Visualising data through plots

# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# glimpse at the alc data
glimpse(alc)
## Rows: 370
## Columns: 41
## $ school        <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G…
## $ sex           <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "…
## $ age           <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, …
## $ address       <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "…
## $ famsize       <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", …
## $ Pstatus       <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "…
## $ Medu          <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,…
## $ Fedu          <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,…
## $ Mjob          <chr> "at_home", "at_home", "at_home", "health", "other", "ser…
## $ Fjob          <chr> "teacher", "other", "other", "services", "other", "other…
## $ reason        <chr> "course", "course", "other", "home", "home", "reputation…
## $ guardian      <chr> "mother", "father", "mother", "mother", "father", "mothe…
## $ traveltime    <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,…
## $ studytime     <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,…
## $ schoolsup     <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",…
## $ famsup        <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye…
## $ activities    <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", …
## $ nursery       <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "…
## $ higher        <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", …
## $ internet      <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye…
## $ romantic      <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "…
## $ famrel        <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,…
## $ freetime      <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,…
## $ goout         <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,…
## $ Dalc          <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ Walc          <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,…
## $ health        <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,…
## $ failures      <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid          <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes…
## $ absences      <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9,…
## $ G1            <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, …
## $ G2            <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15,…
## $ G3            <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16…
## $ alc_use       <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1…
## $ high_use      <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ probability   <dbl> 0.2699557, 0.2894543, 0.2291836, 0.3126444, 0.2858545, 0…
## $ prediction    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ prob2         <dbl> 0.3555814, 0.2953958, 0.2997191, 0.3040783, 0.2822524, 0…
## $ prob3         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ probability_2 <dbl> 0.1849034, 0.1533271, 0.1849034, 0.2569653, 0.2001637, 0…
## $ prediction_2  <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
## Rows: 15,170
## Columns: 2
## $ key   <chr> "school", "school", "school", "school", "school", "school", "sch…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
# it may help to take a closer look by View() and browse the data
gather(alc) %>% View

# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()

Above is the plot of each variable in the data set and represent how each variable is distributed

About my 4 chosen variables

I am choosing the following variable schoolsup - extra educational support(binary: yes or no) higher - wants to take higher education (binary: yes or no) famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) final grade numeric: from 0 to 20, output target

The variable ‘schoolsup’ will help me to understand the impact of high alcohol use on how much support the student need in their studies. My hypothesis is that with increased alcohol use, there will be increase the support needed

With the help of the variable ‘higher’, I want to study whether the willingness to pursue higher studies changes between those who consume more alcohol than others. My hypothesis is that there will be an effect but not so significant on desire to pursue higher education with increased alcohol consumption

the ‘famrel’ variable will help me understand the effect of high alcohol usage on the relationship with family. My hypothesis is that there will be a strain in relationship with family with high consumption

‘final grade’ variable with help me see if there is change in final grades of the student with higher usage as compared to lower usage My hypothesis is that the marks will be negatively effected with high consumption of alcohol

#Exploring the chosen variables through visualization

#1.Exploring high_use with schoolsup i.e., extra support need from school

# access the tidyverse libraries dplyr and ggplot2
library(dplyr); library(ggplot2)

g1 <- ggplot(alc, aes(x = high_use))

g1 + geom_bar() + facet_wrap("schoolsup") + ggtitle("Effect of alcohol consumption on Educational support required") + xlab("High alcohol usage") + aes(fill = high_use)

Observation From the above visualization we can observe that, there are comparatively less students whose alcohol usage is high and requires extra educational support. Interestingly, students for those alcohol usage isn’t high, they seek more support from the school when compared to those with high usage. My hypothesis stands false for this which is very interesting for me

#2. Exploring high_use with higher i.e. desire for higher eduation

g2 <- ggplot(alc, aes(x = high_use))

g2 + geom_bar() + facet_wrap("higher") + ggtitle("Desire to take higher education vs high alcohol usage") + xlab("High alcohol usage") + aes(fill = high_use)

Observation The above variable stands true to my hypothesis that no matter how much is the consumption for alcohol, the desire for higher education doesn’t go away. Lets take a closer look below

#Summarising to see results see more clearly

alc %>% group_by(higher, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'higher'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups:   higher [2]
##   higher high_use count
##   <chr>  <lgl>    <int>
## 1 no     FALSE        7
## 2 no     TRUE         9
## 3 yes    FALSE      252
## 4 yes    TRUE       102

Observation There are only 16 students that do not desire to pursue higher education out of which 9 were with high usage. This means that there is no significant relationship between alcohol usage and desire to study further, which seems logical to me as food choices can be temporary but the choice for further studies is a thought through decision for most people.

#3. Exploring high_use and its effect on famrel i.e., family relationship

g3 <- ggplot(alc, aes(x = high_use))

g3 + geom_bar() + facet_wrap("famrel") + ggtitle("family relationships (1-5) vs high alcohol usage") + xlab("High alcohol usage") + aes(fill = high_use)

Observation Family relationship for most students have been on the positive side without any significant change with high or low usage of alcohol. Infact in the category of bad relationship, the proportion of those with low usage is higher but that is true as there are more number of students with low usage as compared to high usage in general. Therefore, my hypothesis of effect being negative stands false

#4. Exploring high_use and its effect on G3 i,e., Final Grades of student

g4 <- ggplot(alc, aes(x = high_use, y = G3))

g4 + geom_boxplot() + ggtitle("Effect of high alcohol usage on Final grades") + xlab("High alcohol usage") + ylab("Final grade")+ aes(col = high_use)

Observation We can clearly visualize that the students with high consumption have comparatively lower grades than those with low consumption. The hypothesis for the variable stands true that high alcohol consumption has negatively effected the grades of students

Logistic Regression Model

About Logistic regression model

Logistic regression is used to predict the class (or category) based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable, which can have only two possible values

We will now use Logistic regression model to identify factors related to higher than average student alcohol consumption.

# find the model with glm()
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ schoolsup + higher + famrel + G3, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4957  -0.8206  -0.7292   1.3560   1.9346  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   1.83332    0.74905   2.448   0.0144 *
## schoolsupyes -0.37002    0.35929  -1.030   0.3031  
## higheryes    -0.78379    0.55077  -1.423   0.1547  
## famrel       -0.27320    0.12418  -2.200   0.0278 *
## G3           -0.07269    0.03692  -1.969   0.0490 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 437.36  on 365  degrees of freedom
## AIC: 447.36
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
##  (Intercept) schoolsupyes    higheryes       famrel           G3 
##   1.83331647  -0.37001600  -0.78378621  -0.27320431  -0.07269305

Observation We can see that p-value for all the variables, is less than alpha(0.5) which means that with increase in one unit of alcohol consumption, there is a significant rise in support need from school and subsequently similar result can be seen for other variables i.e., there is significant rise in ‘higheryes’, ‘famrel’, and ‘G3’ respectively

There we 4 Fisher scoring iterations before the process of fitting the model stopped and output the results.

In this case, our intercept for different variable is less than 0, it matches with our p-value results observation being less than the alpha

From coefficients to odds ratios(OR)

# find the model with glm()
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR     2.5 %     97.5 %
## (Intercept)  6.2545955 1.4672230 28.1361910
## schoolsupyes 0.6907233 0.3292951  1.3616633
## higheryes    0.4566737 0.1504271  1.3477672
## famrel       0.7609373 0.5955510  0.9708981
## G3           0.9298862 0.8644588  0.9995539

Observation The OR value suggest that when one unit of alcohol consumption (high_use) increases, the odds of educational support decreases 21% and with one unit increase in schoolsup, odds of higher education decreases by 55% and odds of good fam decreases by 24% and by 8% for grades The Odds ratio of The final grade is 1:1:0:0

#Plot Visualisation by providing Confidence Intervals

library(finalfit)
dependent <- "high_use"
explanatory <- c("schoolsup", "higher", "famrel", "G3")
alc %>% 
  or_plot(dependent, explanatory,
          breaks = c(0.5, 1, 2, 5),
          table_text_size = 3.5,
          title_text_size = 16)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Warning: Removed 2 rows containing missing values (geom_errorbarh).

The results of the above plot matches the observations

Explore the predictive power of the model with the significant variables

#Binary prediction (1)

# fit the model
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
m1 <- glm(high_use ~ famrel + G3, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")
prob2 <- predict(m1, type = "response")

library(dplyr)

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prob2 = prob2)

alc <- mutate(alc, prediction = probability > 0.5)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prob3 = prob2 > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, schoolsup, higher, famrel, G3, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 7
##    schoolsup higher famrel    G3 high_use probability prediction
##    <chr>     <chr>   <dbl> <dbl> <lgl>          <dbl> <lgl>     
##  1 no        yes         4    13 FALSE          0.271 FALSE     
##  2 no        yes         4     0 FALSE          0.489 FALSE     
##  3 no        yes         5     2 TRUE           0.387 FALSE     
##  4 no        yes         5    12 FALSE          0.233 FALSE     
##  5 no        yes         4     8 FALSE          0.349 FALSE     
##  6 no        yes         5     5 FALSE          0.336 FALSE     
##  7 no        yes         4    12 FALSE          0.286 FALSE     
##  8 no        yes         1     4 FALSE          0.619 TRUE      
##  9 no        yes         2    13 TRUE           0.391 FALSE     
## 10 no        yes         4    10 TRUE           0.316 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   251    8
##    TRUE    103    8

Observation The model has predicted False as 251/259 The model has predicted True as 8/111 The sensitivity of the model is 8 which is not so good and specificity is 251 which is much better

In the cross table, 8 times the prediction is heavy alcohol consumption when the variable gets a value corresponding to no heavy alcohol consumption. Similarly, 103 times the prediction is no high alcohol consumption when the variable gets a value corresponding to high alcohol consumption.

# rerunning some codes to avoid errors

m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)

#Binary prediction (2)

# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use))

# define the geom as points and draw the plot
g + geom_point() + aes(col = probability)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67837838 0.02162162 0.70000000
##    TRUE  0.27837838 0.02162162 0.30000000
##    Sum   0.95675676 0.04324324 1.00000000

Observation The table above shows that, according to the forecast, there are about 95 percent of all people who do not drink a lot of alcohol. There are approximately 70 percent of all persons who do not drink much alcohol. Note again that the forecast differs from the values of the variable.

Accuracy and loss functions

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, alc$probability)
## [1] 0.3

Observation The loss function gave the prediction error of 0.3

10-fold Cross validation

# K-fold cross-validation
library(boot)

cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)


# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3081081

Observation Since the lower value means better this gives a comparatively poor result than the exercise result which was 0.26

To Improve the model,The model’s explanatory variables can be included, and the cross-validation can then be recalculated. For this I am adding the variable sex as explanatory variable as not only it relates strongly with target variable but also it will be interesting to see it from the point of view of gender

n <- glm(high_use ~ schoolsup + higher + famrel + sex, data = alc, family = "binomial")

# Summary of the model
summary(n)
## 
## Call:
## glm(formula = high_use ~ schoolsup + higher + famrel + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4855  -0.8633  -0.6683   1.2409   1.9794  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    0.7570     0.7207   1.050 0.293510    
## schoolsupyes  -0.0982     0.3675  -0.267 0.789324    
## higheryes     -0.8484     0.5327  -1.593 0.111222    
## famrel        -0.3235     0.1264  -2.559 0.010489 *  
## sexM           0.9137     0.2427   3.765 0.000167 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 426.62  on 365  degrees of freedom
## AIC: 436.62
## 
## Number of Fisher Scoring iterations: 4
#Calculations of the odds of ratios and the confidence interval
cbind(exp(coef(n)), exp(confint(n)))
## Waiting for profiling to be done...
##                            2.5 %    97.5 %
## (Intercept)  2.1319589 0.5214475 8.9591639
## schoolsupyes 0.9064662 0.4264470 1.8220117
## higheryes    0.4280849 0.1454536 1.2158699
## famrel       0.7236339 0.5634688 0.9265524
## sexM         2.4935013 1.5567858 4.0385316
# Prediction of the probability
probabilities_2 <- predict(n, type = "response")
# Probabilities to alc
alc <- mutate(alc, probability_2 = probabilities_2)
# Calculate a logical high use value based on probabilities.
alc <- mutate(alc, prediction_2 = probability_2 > 0.5)
# Recalculation of cross-validation and print the average number of wrong predictions.
cv_2 <- cv.glm(data = alc, cost = loss_func, glmfit = n, K = 10)
cv_2$delta[1]
## [1] 0.2972973

Interpretation The result of 0.2891892 is much better than the 0.3108108 which was the result of original model

**____End of Chapter___**


Chapter 4: Clustering and classification

date()
## [1] "Fri Nov 25 15:29:16 2022"

About the Data set

Boston data set has 506 observations and 14 variables. It is one of the pre installed data sets that comes with the Mass Package. It contains information regarding the Housing values in suburbs of Boston.

The variables involved are as follows:

crim - per capita crime rate by town.

zn - proportion of residential land zoned for lots over 25,000 sq.ft.

indus - proportion of non-retail business acres per town.

chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox - nitrogen oxides concentration (parts per 10 million).

rm - average number of rooms per dwelling.

age - proportion of owner-occupied units built prior to 1940.

dis - weighted mean of distances to five Boston employment centres.

rad - index of accessibility to radial highways.

tax - full-value property-tax rate per $10,000.

ptratio - pupil-teacher ratio by town.

black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat - lower status of the population (percent).

medv - median value of owner-occupied homes in $1000s.

Source Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.

Loading the Data Set, Installing package and Exploring Data set

#Loading required libraries for the Exercise

library(ggplot2)
library(dplyr)
library(corrplot)
## corrplot 0.92 loaded
library(GGally)
library(tidyr)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
#Loading Data Set
data("Boston")

#Exploring the Data Set
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

Observation: The data set has 506 Observations of 14 variables The variables chas and rad are integers whiles others are numerical. From the function dimensions, from the Min. and Max range we can see that the variable distribution is skewed except for the variables rm, and medv.

Graphical Distribution and summary of variables

# plot matrix of the variables
pairs(Boston)

# printing the correlation matrix to look at the correlations between variables in the data
cor_matrix <- cor(Boston) %>% round(digits = 2)

# visualize the correlation matrix
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6,)

Observations

Pair Plot: The pair plot is the visual representation of the observation as mentioned before this

Correlation Plot: According to the Correlation plot, we can say that there many variables with high positive correlation for example the variables ‘rad’ and ‘tax’ have the highest positive correlation among all the variables. On the other hand, we can also see strong negative correlation between many variable, highest negative correlation being among the variable like ‘lstat’ and ‘medv’, ‘age’ and ‘dis’, ‘nox’ and ‘dis’, and, ‘indus’ and ‘dis’. Therefore, we can also say that ‘dis’ (distance) has the most highest negative correlation among all the other variables

Standardization of Data Set and Categorization of Variable

  1. Standardization of data Set It is the process through which we take the mean of the relevant columns, subtract them, then divide the result by the standard deviation.
# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(scale(Boston))

summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

Observation We can see that now all variables are normally distributed and because of the standardization process, it is done in such a way that mean of all variables is equal to zero

2.Creating a categorical variable of the crime rate in the Boston data set

# Create a quantile vector of crim and create the categorical "crime".
bins <- quantile(boston_scaled$crim)

crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))

# Remove the original unscaled variable and add the new categorical value to scaled data.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

# Look at the table of the new factor crime
table(boston_scaled$crim)
## 
##      low  med_low med_high     high 
##      127      126      126      127

Observation We have now created a categorical variable(previously continuous) and changed the name from crim to crime which has now been added as a separate variable in the boston_scaled data set which has 127 observation for low crime, 126 for medium to low and medium to high crime and 127 for high crime

Divide the dataset to train and test sets.

When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works. The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.

Below the size = n x 0.8 representation means that it will randomly choose the 80% of the data to create a training data set

# Number of rows in the Boston dataset
n <- nrow(boston_scaled) 

# Choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# Create train set
train <- boston_scaled[ind,]

# Create test set
test <- boston_scaled[-ind,] 

# Save the correct classes from the test data
correct_classes <- test$crime

# Remove the crime variable from the test data
test <- dplyr::select(test, -crime)

Linear Discriminant Analysis on the train set

LDA is done to find the linear combination among the variable which separates the classes of target variables For this analysis, we will use ‘Crime’ as the target variable

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2623762 0.2326733 0.2376238 0.2673267 
## 
## Group means:
##                   zn      indus        chas        nox          rm        age
## low       0.97543647 -0.8977735 -0.08661679 -0.8852680  0.46287527 -0.8775245
## med_low  -0.06462671 -0.3262614  0.02085925 -0.5749309 -0.09190006 -0.3426089
## med_high -0.39612627  0.2502745  0.17879700  0.4436498  0.05460387  0.3699489
## high     -0.48724019  1.0169921 -0.09005591  1.0622579 -0.36757432  0.8271536
##                 dis        rad        tax     ptratio       black       lstat
## low       0.8824996 -0.6958379 -0.7677688 -0.45095283  0.37857765 -0.77564271
## med_low   0.3438926 -0.5505852 -0.5134479 -0.04678098  0.31754146 -0.15753654
## med_high -0.3923216 -0.3980671 -0.2845421 -0.32396516  0.06635014  0.05756895
## high     -0.8571271  1.6393984  1.5149640  0.78225547 -0.78397836  0.88251893
##                medv
## low       0.5420313
## med_low   0.0346041
## med_high  0.1663234
## high     -0.7215869
## 
## Coefficients of linear discriminants:
##                 LD1          LD2         LD3
## zn       0.08447768  0.684782154 -0.84449575
## indus    0.01780437 -0.235238688  0.19137881
## chas    -0.08093902 -0.046756104  0.13910844
## nox      0.34259248 -0.758739887 -1.33895099
## rm      -0.06157118 -0.043640551 -0.16232800
## age      0.31702220 -0.209611717 -0.09007108
## dis     -0.05161374 -0.231216406  0.03770089
## rad      3.10420071  1.043655642 -0.14326266
## tax      0.10288507 -0.137064991  0.54008441
## ptratio  0.11563945  0.007061256 -0.10758522
## black   -0.11256833  0.022090876  0.12254221
## lstat    0.25126096 -0.270506395  0.30521965
## medv     0.18675535 -0.452447724 -0.22984616
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9547 0.0353 0.0100
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(factor(train$crime))

# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 1)

Observation The proportion of trace tells us that LD1 explains variance up-to 95%, LD2 does it up to 3.58% and LD3 does it for 1.07%. Through the plot we can say that the target variable crime is clearly separated, and the variable accessibility to ‘rad’ is optimal as seen through the arrow.

Prediction and Cross tabulation

(Please Note:The saving of crime categories and removal of the categorical crime variable from the test data set can be found just before the start of LDA exercise in the Divide the data set to train and test sets exercise)

# Predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       12       8        1    0
##   med_low    4      21        7    0
##   med_high   0      13       16    1
##   high       0       0        0   19

Observation The cross tabulation suggests that the model predicts the ‘high’ crime rate quite well. There are 102 total observation and The majority of the predicted values are predicted as correct values for the same category making the model usable for rough predictions.

Reload and standardize the Boston dataset

# Reload Boston dataset.
data("Boston")

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame from matrix type.
boston_scaled <- as.data.frame(boston_scaled)

Calculate the distance between Observations, K means clustering and visualisation

# Calculate the Euclidean distances between observations.
dist_eu <- dist(boston_scaled, method = "euclidean")

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970

Observation The most normal distance measure is Euclidean, according to which the distance range is between 0.1343 to 14.3970, with a mean of 4.9111 and median of 4.8241

# K-means clustering
km <-kmeans(boston_scaled, centers = "3")

# plot the Boston dataset with clusters
pairs(boston_scaled[6:10], col = km$cluster)

Observation As I have clustered the data into 5 columns and definde k-means as 3 cluster

We will now find the optimal number for clusters

# Investigate optimal number of clusters and rerun algorithm
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# Visualize the cluster.
qplot(x = 1:k_max, y = twcss, geom = 'line')

Observation We can see that optimal number drops radically and with this we can see that after every two cluster, there seems to be a change.

Let us now check how does it look with optimal number 2

# k-means clustering
km <-kmeans(dist_eu, centers = 2) 

# Plot the clusters
pairs(boston_scaled[6:10], col = km$cluster)

Observation We can see that because of our previous observation was of 3 clusters and optimal is 2, there isnt much difference but going by the results, ‘two’ seems to be the optimal number for clustering the data, when the method chosen is Euclidean

Super Bonus: Create a 3-D plot

model_predictors <- dplyr::select(train, -crime)

# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Plot.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# k-means with 4 clusters to compare the methods.
km3D <-kmeans(boston_scaled, centers = 4)


#Plot with km cluster as color of data
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km3D$cluster[ind])

Observation

Through the above super cool 3D plots, my observation is that the k-means clustering works better. In the crime plot, it was hard to establish which variables are falling where as some of them got a bit submerged in each other.

**____End of chapter___**

(more chapters to be added similarly as we proceed with the course!)